A DuckDB Interface to Orbis
2026-05-28
Why MoodyDuck?
Orbis delivers data as large TSV files
Typical workflow (before)
1. Receive BvD flatfiles
2. Try to load TSV into Stata / R
3. Wait
4. Run out of memory
5. Subset the TSV manually
6. Reload
7. Repeat for every researcher and project
Note
But why not just use the Orbis User Interface?
MoodyDuck packages all Orbis flatfiles into a single
high-performance database file.
📁 One file: orbis_202506.duckdb
on GBP, Area & Cobra Workstations
⚡ Fast and memory efficient: query data without loading it into memory
🔗 No server: Runs inside your R or Python session directly
An analytical processing database built for data science
dplyr (R) and pandas/polars (Python)What’s in MoodyDuck?
💰 Financial Statements financials_industry · _eur · _usd
financials_banks · financials_insurances
key_financials · _eur · _usd
12 tables
🏢 Firm Demographics entities · identifiers · names
legal_info · status_history
contact_info · addresses· overviews
8 tables
🔗 Ownership Links links_current
links_2007 through links_2024
19 tables
🏭 Industry industry_classifications
NACE Rev. 2 · SIC · NAICS
1 table
👥 Advisors auditors_current
bankers_current/previous
dmc_current/previous
5 tables
📈 Market stock_indexes
1 table
bvd_id_numberEvery table uses bvd_id_number as the firm identifier.
The first two characters encode the country iso code of the company:
| Prefix | Country |
|---|---|
DE |
Germany |
FR |
France |
GB |
United Kingdom |
US |
United States |
NL |
Netherlands |
Country filtering is therefore straightforward.
How can you access MoodyDuck?
GBP Workstation
Remote Desktop access
D:\OrbisDB\Data\duckdb\
Area Workstation
Remote Desktop access
Q:\Users\Richard Winter\ OrbisDB\Data\duckdb\
Cobra Workstation
On-site · Room SO 104
E:\MoodyduckDB\Data\duckdb\
Win + R → mstsc)GBP Workstation
D:\OrbisDB\Data\duckdb\orbis_202506.duckdb
Area Workstation
Q:\Users\Richard Winter\OrbisDB\Data\duckdb\orbis_202506.duckdb
WS Admin: pascal.schrader@uni-mannheim.de
Room SO 104: Chair of Taxation, Accounting, and Finance, Center for Corporate Behavior and Regulation Analysis (COBRA)
E:\MoodyduckDB\Data\duckdb\orbis_202512.duckdb
⚠️ Cobra runs a different vintage (orbis_202512) than GBP and Area (orbis_202506). Do not mix vintages across scripts in the same project.
WS Admin: matti.boie-wegener@uni-mannheim.de
Getting Started
⚠️ Always use read_only = TRUE.
One write-mode connection locks the file for all users on the server.
Querying the Database
Queries are not executed until you ask for the result.
Use show_query(query) to see the generated SQL.
💡 Financial files are not firm-year panels. A firm may have multiple entries across different years (e.g., due to different consolidation levels, filing types, etc.).
fin <- tbl(con, "financials_industry_eur") |>
filter(
substr(bvd_id_number, 1, 2) == "DE",
financial_year %in% 2020:2022,
consolidation_code %in% c("U1", "U2")
) |>
distinct(bvd_id_number, financial_year, .keep_all = TRUE)
ind <- tbl(con, "industry_classifications") |>
filter(!is.na(nace_rev_2_main_section)) |>
select(bvd_id_number, nace_rev_2_main_section)
leg <- tbl(con, "legal_info") |>
select(bvd_id_number, status, standardised_legal_form)
df <- fin |>
left_join(ind, by = "bvd_id_number") |>
left_join(leg, by = "bvd_id_number") |>
collect()query = """
WITH fin AS (
SELECT DISTINCT ON (bvd_id_number,
financial_year) *
FROM financials_industry_eur
WHERE SUBSTR(bvd_id_number,1,2) = 'DE'
AND financial_year BETWEEN 2020 AND 2022
AND consolidation_code IN ('U1','U2')
),
ind AS (
SELECT
bvd_id_number, nace_rev_2_main_section
FROM industry_classifications
WHERE nace_rev_2_main_section IS NOT NULL
),
leg AS (
SELECT bvd_id_number, status,
standardised_legal_form
FROM legal_info
)
SELECT f.*, i.nace_rev_2_main_section,
l.status, l.standardised_legal_form
FROM fin f
LEFT JOIN ind i USING (bvd_id_number)
LEFT JOIN leg l USING (bvd_id_number)
"""
df = con.execute(query).df()Aggregate inside DuckDB — don’t collect first and then summarise in R/Python.
tbl(con, "financials_industry_eur") |>
filter(
substr(bvd_id_number, 1, 2) == "DE",
financial_year %in% 2020:2022,
consolidation_code %in% c("U1", "U2")
) |>
group_by(financial_year) |>
summarise(
n_firms = n(),
med_assets = median(total_assets, na.rm = TRUE),
mean_emp = mean(number_of_employees, na.rm = TRUE)
) |>
collect()con.execute("""
SELECT
financial_year,
COUNT(*) AS n_firms,
MEDIAN(total_assets) AS med_assets,
AVG(number_of_employees) AS mean_emp
FROM financials_industry_eur
WHERE SUBSTR(bvd_id_number, 1, 2) = 'DE'
AND financial_year BETWEEN 2020 AND 2022
AND consolidation_code IN ('U1', 'U2')
GROUP BY financial_year
""").df()The links_YYYY tables record subsidiary → shareholder pairs with type of relation and ownership percentages.
links <- tbl(con, "links_2022") |>
filter(
substr(shareholder_bvd_id, 1, 2) == "DE",
ownership_percentage >= 50
) |>
select(shareholder_bvd_id, subsidiary_bvd_id, ownership_percentage)
sub_info <- tbl(con, "legal_info") |>
select(bvd_id_number, country_iso_code, status) |>
rename(subsidiary_bvd_id = bvd_id_number)
result <- links |>
left_join(sub_info, by = "subsidiary_bvd_id") |>
collect()# R-native (fast, preserves types)
saveRDS(df, "data/sample.rds")
# Parquet — recommended for sharing
arrow::write_parquet(df, "data/sample.parquet")
# Export to Stata
haven::write_dta(df, "data/sample.dta")
# Write Parquet directly from DuckDB
DBI::dbExecute(
con,
"
COPY (SELECT ...) TO 'data/de.parquet'
(FORMAT PARQUET)
"
)# Parquet — recommended
df.to_parquet("data/sample.parquet",
index=False)
# CSV
df.to_csv("data/sample.csv", index=False)
# Export to Stata
import pyreadstat
pyreadstat.write_dta(df, "data/sample.dta")
# Write Parquet directly from DuckDB
con.execute("""
COPY (SELECT ...) TO 'data/de.parquet'
(FORMAT PARQUET)
""")💡 Prefer Parquet. Smaller than CSV, faster to read, preserves types. Readable by R, Python, Stata 18+, and DuckDB.
Shared Server Resource Management: Don’t make everybody angry with you
MoodyDuck runs on shared Windows Server workstations.
By default, DuckDB claims:
Since you’re not the only one using these resources this may lead to:
⚠️ memory_limit controls DuckDB’s internal buffer only — not the memory used when R/Python stores the collected result. Only collect the rows and columns you actually need.
When DuckDB exceeds memory_limit, it spills intermediate results to a temp directory.
Default: C:\Windows\Temp — often slow, shared, limited space.
Redirect to a fast dedicated location:
If queries are spilling often…
Don’t just raise memory_limit.
Instead:
con <- DBI::dbConnect(
duckdb::duckdb(),
dbdir = "D:/OrbisBackup/orbis_202506.duckdb",
read_only = TRUE
)
tryCatch(
{
DBI::dbExecute(con, "SET memory_limit = '20GB'")
DBI::dbExecute(con, "SET threads = 4")
# ... your analysis ...
},
finally = {
DBI::dbDisconnect(con, shutdown = TRUE)
# ↑ runs even if an error occurs
}
)⚠️ Multiple read-only connections are fine. One write-mode connection locks the file for everyone.
library(duckdb)
library(DBI)
library(tidyverse)
con <- DBI::dbConnect(
duckdb::duckdb(),
dbdir = "D:/OrbisBackup/orbis_202506.duckdb",
read_only = TRUE
)
DBI::dbExecute(con, "SET memory_limit = '20GB'") # ← adjust
DBI::dbExecute(con, "SET threads = 4") # ← adjust
DBI::dbExecute(con, "SET temp_directory = 'D:/DuckDBTemp'")
# --- your analysis here ---
DBI::dbDisconnect(con, shutdown = TRUE)import duckdb
import pandas as pd
con = duckdb.connect(
database = "D:/OrbisBackup/orbis_202506.duckdb",
read_only = True
)
con.execute("SET memory_limit = '20GB'") # ← adjust
con.execute("SET threads = 4") # ← adjust
con.execute("SET temp_directory = 'D:/DuckDBTemp'")
# --- your analysis here ---
con.close()On Area and GBP workstations, connect to the MoodyDuck Hub via localhost:4239 in your web browser.
Access to slides and documentation.
Access to interactive Graphical User Interface to Query & Explore Company Info, Financials, Industry, and Ownership Links.
Note
The MoodyDuck Hub and the GUI App are still under development and may be unavailable at times when new updates are being rolled out. Best used for small queries and quick lookups, not for heavy data extraction.
Feel free to reach out if you want to contribute to this project or for questions on any of the following:
📦 Orbis & MoodyDuck: data coverage, table structure, access, updates
🤖 AI tools: Claude Code, LLM workflows
MoodyDuck · Orbis via DuckDB